‘White Wine Quality’ is a tidy dataset contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) 12 - quality (score between 0 and 10)
/Which chemical properties influence the quality of white wines?
In this stduy, I will make exploratory data analysis to understand the data before the inferential stastics tests are applied.
The structure of the data:
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
All of the variables are numbers and there exist no factor type in the dataset.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
Quality values are between 3 and 9. Median and Mean is very close to each other which means the distribution is not so-skewed. Lets start with the distribution of the quality outputs.
The quality of the wines are mostly cumulated at value of 6. There exist very few wines having quality score of 9. Since it is scale, there exist no decimal values.
Let’s investigate other variables and their distributions.
There seems to be a few outliers.
The distribution of acidity is very close to normal distribution. But there exist some outliers in the data.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
It can also be seen from the summary table and outlier graph that there exist few data points between 3rd Quantile and Max Values.
if I trim top 1 percentile, I obtain the below graph which is approximately normal.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
In the dataset, there exist extreme data points which make dataset skewed. I should investigate these extreme points and their possible relationship with the quality. The outlier graph shows that there exist so many outliers in the dataset. It can also be observed from the histogram.
If I trim the top 1 percentile, then I obtain the below graph.
The distribution is still right skewed.
In citric acit feature, there exist so many high and low outlier values. Therefore, trimming the outliers would contribute a lot. It is interesting to see that there exist extreme counts at 0.3, which seems to be peak and at around 0.5. Other than 0.5, the citric acid values has a bell shaped distribution. An extra attention should be given 0.5 point.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
In citric acid values, there also exist extreme high values. If the top 1 percentile is trimmed, I obtain the below graph which is much close to normal-like distribution.:
##
## 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14
## 19 7 6 2 12 5 6 12 4 12 14 1 19 17 27
## 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29
## 23 33 27 49 48 70 66 104 83 181 136 219 216 282 223
## 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44
## 307 200 257 183 225 137 177 134 122 101 117 82 95 37 63
## 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
## 46 51 38 39 215 35 25 23 16 19 11 22 13 21 6
## 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74
## 6 9 14 4 6 8 7 7 7 5 3 9 5 5 41
## 0.78 0.79 0.8 0.81 0.82 0.86 0.88 0.91 0.99 1 1.23 1.66
## 2 2 2 2 2 1 1 2 1 5 1 1
0.49 is a common citric acid value although the other close points are not common.
From the outlier graph we can see few high outlier points. Therefore, trimming these outliers would give better results.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Residual sugar distribution is highly skewed. There exist few extremely high values but so many extremely low values but not outliers.
If the residual.sugar value is trimmed, the below graph is obtained:
It is interesting to see that the distribution is multimodal. Therefore, many wines with various residual sugar levels exist. One includes very little residual.sugar(1.5), one is sweet (5), one is more sweety (around 8). Most probably, people either prefer sweet or non-sweet wines.
##
## 0.6 0.7 0.8 0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3
## 2 7 25 39 4 93 1 146 3 187 3 147
## 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 1.8 1.85 1.9
## 2 184 4 142 2 165 2 99 1 99 3 59
## 1.95 2 2.05 2.1 2.2 2.25 2.3 2.35 2.4 2.5 2.6 2.65
## 2 79 1 51 56 2 42 1 41 40 33 1
## 2.7 2.8 2.85 2.9 3 3.1 3.15 3.2 3.3 3.4 3.5 3.6
## 38 36 1 25 17 17 1 28 23 13 31 22
## 3.7 3.75 3.8 3.85 3.9 3.95 4 4.1 4.2 4.25 4.3 4.35
## 12 2 21 3 17 3 19 17 31 2 19 1
## 4.4 4.45 4.5 4.55 4.6 4.7 4.75 4.8 4.85 4.9 5 5.1
## 14 3 33 2 40 29 5 38 1 35 43 28
## 5.15 5.2 5.25 5.3 5.35 5.4 5.45 5.5 5.55 5.6 5.7 5.8
## 2 29 4 17 2 23 2 13 1 16 30 23
## 5.85 5.9 5.95 6 6.1 6.2 6.3 6.35 6.4 6.5 6.55 6.6
## 2 19 1 23 21 31 39 1 34 26 1 30
## 6.65 6.7 6.75 6.8 6.85 6.9 6.95 7 7.05 7.1 7.2 7.25
## 3 25 1 28 6 20 1 31 2 36 29 2
## 7.3 7.35 7.4 7.45 7.5 7.6 7.7 7.75 7.8 7.85 7.9 7.95
## 19 2 40 1 30 29 34 2 41 1 32 1
## 8 8.1 8.15 8.2 8.25 8.3 8.4 8.45 8.5 8.55 8.6 8.65
## 32 34 1 36 2 31 13 1 24 1 27 1
## 8.7 8.75 8.8 8.9 8.95 9 9.05 9.1 9.15 9.2 9.25 9.3
## 18 2 22 23 1 18 1 17 2 22 2 11
## 9.4 9.5 9.55 9.6 9.65 9.7 9.8 9.85 9.9 10 10.05 10.1
## 10 9 1 18 4 22 16 3 18 18 3 14
## 10.2 10.3 10.4 10.5 10.55 10.6 10.65 10.7 10.8 10.9 11 11.1
## 23 16 25 16 1 22 1 26 17 11 19 18
## 11.2 11.25 11.3 11.4 11.45 11.5 11.6 11.7 11.75 11.8 11.9 11.95
## 18 2 12 14 1 11 15 8 4 35 16 3
## 12 12.05 12.1 12.15 12.2 12.3 12.4 12.5 12.55 12.6 12.7 12.75
## 16 1 21 4 15 13 19 16 2 16 16 1
## 12.8 12.85 12.9 13 13.1 13.15 13.2 13.3 13.4 13.5 13.55 13.6
## 25 4 25 19 23 1 13 16 7 10 3 12
## 13.65 13.7 13.8 13.9 14 14.05 14.1 14.15 14.2 14.3 14.35 14.4
## 4 21 8 18 16 1 4 1 20 17 3 17
## 14.45 14.5 14.55 14.6 14.7 14.75 14.8 14.9 14.95 15 15.1 15.15
## 3 17 3 13 14 2 12 14 2 13 7 1
## 15.2 15.25 15.3 15.4 15.5 15.55 15.6 15.7 15.75 15.8 15.9 16
## 6 1 9 17 11 6 14 9 1 6 2 10
## 16.05 16.1 16.2 16.3 16.4 16.45 16.5 16.55 16.6 16.65 16.7 16.75
## 6 2 7 7 5 1 3 1 2 5 5 2
## 16.8 16.85 16.9 16.95 17 17.05 17.1 17.2 17.3 17.35 17.4 17.45
## 4 4 3 3 1 1 5 9 14 1 2 2
## 17.5 17.55 17.6 17.7 17.75 17.8 17.85 17.9 17.95 18 18.05 18.1
## 8 3 2 1 4 13 5 2 3 2 3 6
## 18.15 18.2 18.3 18.35 18.4 18.5 18.6 18.75 18.8 18.9 18.95 19.1
## 8 3 2 4 1 1 1 4 3 1 3 1
## 19.25 19.3 19.35 19.4 19.45 19.5 19.6 19.8 19.9 19.95 20.15 20.2
## 3 4 1 2 3 2 1 4 1 3 1 2
## 20.3 20.4 20.7 20.8 22 22.6 23.5 26.05 31.6 65.8
## 1 1 2 2 2 1 1 2 2 1
The box plot shows huge amount of outliers which means the distribution is higly skewed. The skewness can be also seen from the histogram. However, it is difficult to understand the graph using this bin sizes. I should narrow down them, change the x scale to obtain a better visualiztaion. Let’s look at the summary table of chlorides variable.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
The distribution is good but the spread of data is wide. If I omit top 1 percent of data:
Although most of the data points are clustered around 0.05 (3rd quartile), there exist considerable amount of data above 0.05. Large amount of data is cumulated around 0.5 and there exist a spread of data having value grater than 0.10.
##
## 0.009 0.012 0.013 0.014 0.015 0.016 0.017 0.018 0.019 0.02 0.021 0.022
## 1 1 1 4 4 5 5 10 9 16 19 19
## 0.023 0.024 0.025 0.026 0.027 0.028 0.029 0.03 0.031 0.032 0.033 0.034
## 20 34 30 54 58 85 81 108 107 109 119 168
## 0.035 0.036 0.037 0.038 0.039 0.04 0.041 0.042 0.043 0.044 0.045 0.046
## 130 200 160 167 157 182 147 184 141 201 170 181
## 0.047 0.048 0.049 0.05 0.051 0.052 0.053 0.054 0.055 0.056 0.057 0.058
## 171 174 133 170 115 104 130 99 61 88 68 53
## 0.059 0.06 0.061 0.062 0.063 0.064 0.065 0.066 0.067 0.068 0.069 0.07
## 36 46 19 25 23 15 8 18 18 7 18 6
## 0.071 0.072 0.073 0.074 0.075 0.076 0.077 0.078 0.079 0.08 0.081 0.082
## 5 2 5 8 2 9 1 2 4 4 2 2
## 0.083 0.084 0.085 0.086 0.087 0.088 0.089 0.09 0.091 0.092 0.093 0.094
## 5 5 3 4 3 2 1 2 1 3 3 5
## 0.095 0.096 0.097 0.098 0.099 0.102 0.104 0.105 0.108 0.11 0.112 0.114
## 2 6 1 3 1 1 1 1 2 3 1 1
## 0.115 0.117 0.118 0.119 0.12 0.121 0.122 0.123 0.126 0.127 0.13 0.132
## 1 3 1 3 1 2 1 4 3 2 1 1
## 0.133 0.135 0.136 0.137 0.138 0.142 0.144 0.145 0.146 0.147 0.148 0.149
## 1 1 1 2 2 3 1 1 1 2 1 1
## 0.15 0.152 0.154 0.156 0.157 0.158 0.16 0.167 0.168 0.169 0.17 0.171
## 1 2 1 1 4 1 2 2 3 2 2 1
## 0.172 0.173 0.174 0.175 0.176 0.179 0.18 0.184 0.185 0.186 0.194 0.197
## 2 2 2 2 2 1 1 2 2 1 1 2
## 0.2 0.201 0.204 0.208 0.209 0.211 0.212 0.217 0.239 0.24 0.244 0.255
## 1 2 1 2 1 1 1 1 1 1 1 1
## 0.271 0.29 0.301 0.346
## 1 1 1 1
There exist several data points above 0.01 and these data points have large spread.
There exist so many outliers as most other features. We should trim the outliers to make better analysis. First, lets arrange binwidths to obtain deeper insight.
If we look also the summary statistics:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
There exist extremely large variables similar to other variables. If the top 1 percentile is omitted:
This time, the distribution is quite better and similar to normal. We can see that only very small amount of data have extreme values. The skewness in the data is very low.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
The data follows a pattern similar to previous variables. There exist extremely large variables and a few outliers but most of the data has a bell-shaped normal-like distribution. If the top 1 percentile is omitted the below distribution is obtained.
Most of the data values are between 50 and 240.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
The spread of density seems very narrow. There are nearly no outliers. Let’s use smaller bin sizes:
Most density values are between 0.98 and 1. Lets omit the top 1 percentile.
The most of the density data are cumulated between 0.990 and 0.997
The data distribution seems bell-shaped but there exist several outliers. Because the outliers are on both sizes, the distribution is not skewed. If the bin size is narowed:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
There is ignorable skewness in the data set and the spread is quite narrow.
The data seem to be a bit right skewed. However, there exist very few outliers. Let’s narrow the bin sizes.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
If the top 1 percentile is ommited:
There exist still some right-skewness in the data but very close to bell-shaped distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
There exist nearly no outliers.The distribution seems to be multi- modal. These are (8.5 - 10), (10-11.5) and (11.5-13). The biggest group is (8.5-10) group. Most data exist at the point 9.5. Nearly twice of the other peaks.
##
## 8 9 10 11 12 13 14
## 317 1606 1256 906 675 131 7
There exist 4898 observations and 12 variables in the dataset. All variables are numeric and quality is an inetger type. Most variables have right-skewed distributions with extremely high values.
Sulphate and density also follow the pattern : bell shaped distribution with few extreme high values.
Alcohol has 3 main clusters around 9.5, 10.5 and 12.5.
The main features in the data set are alcohol, volitile acidity and residual sugar. I suspect those features and some combinations of the other variables can be used to build a predictive model.
investigation into your feature(s) of interest?
I think that sulfur, sulphate and choloride values are also important factors to be investigated in the future.
No, I did not.
form of the data? If so, why did you do this?
I made severeal operations on the data set. I have log transformed skewed distribution of residual sugar to visualize the behavior of data better. I also trimmed top 1 percentile of the data points of most variables to visulize the common pattern in the data.
I played with bin sizes to visulize the distribution clearly and define the pattern. Generally, I arranged the bin sizes according to data points’ precision to detect the details. By this way, I have detected the cluster of data around 0.49 in the citric acid feature.
There exist different quality levels for the same amount of sugar. Let’s investigate quality conditional on residual.sugar
Average quality has very high variance conditional on residual.sugar. For very close residual.sugar values, quality changes a lot which means very low correlation. However on average, extreme residual.sugar values have less quality.
When residual sugar is between 1.5 and 5 the quality is robust and highest mean of means.
When residual sugar is between 5 and 10, variance in quality is very high and quality mean reaches very high values. However, mean of means is quite low.
When we look at the mean and mean of means, we can see that there exist a pattern in quality conditional on alcohol. If the extreme values are trimmed:
The trimmed model has a better positive linearrelationship between 9.5 and 13. Best robust qualities are reached between 12 and 13 alcohol level.
##
## Pearson's product-moment correlation
##
## data: alcohol and quality
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4126015 0.4579941
## sample estimates:
## cor
## 0.4355747
From the graph, it can be seen that there exists a negative relatinship between volitile acidity and quality. Let’s investigate the data without extreme points.
Trimming the extreme high points decreased the slope, however a negative relatinship can be observed clearly. After 0.5 volitile acidity, the slope (strenght of the relationship) increases.
##
## Pearson's product-moment correlation
##
## data: volatile.acidity and quality
## t = -13.891, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2215214 -0.1676307
## sample estimates:
## cor
## -0.194723
Quality and free.sulfur.dioxide has a positive relationship between 0 and 30. After 40, mean of means decreases and falls down to quality lecel of 6.
total.sulfur.dioxide seems to have a positive relationship with quality betweeen 0 and 90. The curve’s slope becomes negative after 100, but the strenght of relationship is low. It is important to emphasize that quality value is very robust between 75 and 150. For small values of total sulfur dioxide, quality is very volitile.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
Cholorides values are mostly cumulated around 0 and 0.1. If we look at the summary table:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
even the third quantile is 0.05. If we trim extreme values and draw quality condional on cholorides:
It is clear that quality is very robust between 0.025 and 0.75 with a negative relationship with chlorides. The volatility increases after 0.10.
It is difficult to say that there exist a relationship between quality and sulphates visually. The only significant increase is around 0.8 sulphates value.
We can also look at citric acid, fixed acidity, density and pH relationships.
##
## Pearson's product-moment correlation
##
## data: density and quality
## t = -22.581, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3322718 -0.2815385
## sample estimates:
## cor
## -0.3071233
Quality does not seem to vary conditional on pH and fixed acidity as expected. However, quality seems to have relationship between density and citric acid. Especially denser wines seem to have less quality value on average.
residual.sugar, alcohol, volitile.acidity, total.sulfur.dioxide, citric acid and density should be given more importance.
Lets look at a few variables conditional on alcohol.
There exist a clear decreasing trend of residual sugar between 8 and 10 alcohol level.
The reltionship shows that for the alcohol value more than 11, volitile acidity increases.
##
## Pearson's product-moment correlation
##
## data: volatile.acidity and alcohol
## t = 14.107, df = 1559, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2917104 0.3797308
## sample estimates:
## cor
## 0.3364553
There exist a decreasing total.sulfur.dioxide trend for increasing alcohol levels.
##
## Pearson's product-moment correlation
##
## data: alcohol and total.sulfur.dioxide
## t = -35.15, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4709775 -0.4262443
## sample estimates:
## cor
## -0.4488921
##
## Pearson's product-moment correlation
##
## data: alcohol and density
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7908646 -0.7689315
## sample estimates:
## cor
## -0.7801376
Density and alcohol have a strong negative correlation (-0.78) as seen in the above graph and correlation calculation.
High quality wines generally have high alcohol levels.
Low and high quality wines include similar amount of sugar and the data points are quite volitile. It is difficult to detect a clear pattern.
High quality wines clearly have low density on average.
It is difficult to detect a trend for high quality wines and volatile acidity when extremes are trimmed.
Free sulfur dioxide is at similar amounts for different quality levels. Different quality wines have similar amount of citric acid. However, high quality (9) wines have significantly higher amount of citric acid.
Different quality of wines have similar amounts of sulphates and fixed acidity.
Bivariate Analysis showed that other than alchohol none of the variables have direct linear relationship with quality. However, some variables have relationship with quality and bewteen each other. Generally, it can be observed that for different mixture of inputs, a high quality or low quality value can be obtained.
the investigation. How did the feature(s) of interest vary with other features
in the dataset?
Alcohol, volitile acidity and residual sugar were primary features of interest.
Alcohol was found to have a positive correlation (0.43) with quality. When smoothed, the correlation increases significantly and can be seen in the graph. From box plot with quality, high quality wines were observed to have better quality values.
Volatile acidity was found to have a significant negative correlationship (-0.195) with quality. If smoothed, the relationship is observed better. The box plot graph shows that high quality levels generally have lower volatile acidity. However, this relationship is not strong.
I was expecting a strong relationship between quality and residual.sugar. However, the relationship is not quite strong. The only observation is that low sugar level is included either in high quality or low quality wines. Mid-quality wines include high residual.sugar.
Althogh Density was not a primary interest, it was found to have a negative relationship with quality and significant correlation (-0.307 ). From box-plot diagram, it can be observed that high quality wines are less dense when trimmed.
When alcohol and density relationship investigated, it was found that they have a quite strong correlation (-0.78). Of course, high level alcohol means less dense wines.
Alcohol and total.sulfur dioxide graph showed a relatively strong relationship and negative linear correlation (-0.4488921). On the other hand alcohol and volatile.acidity were found to have a positive relationship (0.34) for alcohol value higher than 11.
The strongest relationship was found between alcohol and density variables as expected.
When I add quality_grouped into alcohol-residual.sugar I observe that high-quality (dark blue points) wines generally have high level alcohol.
An important implication of the graph is that high quality wines generally have low residual.sugar level (less than 5). However, low level of sugar does not mean high quality wines. There exist so many wine types which are low quality and includes low residual.sugar.
From the above graphs, it can be observed that most low quality wines are composed of low alcohol and high density and most high quality wines are composed of low denisty and high alcohol level. The medium quality wines are dispersed all over the graph.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
It can be clearly seen that high quality wines mostly have more that median citric acid level. To be more clear, high quality wines are distributed between median and 3rd quantile citric.acid values. Nearly none of the high quality wine producers produced wines with low citric acid values.
The graph implies that high quality wines have high level alcohol and low level volatile acidity. An important implication is that the quality difference coditional on alcohol becomes more significant when volitile acidity increases.
There exist a negative relationship between quality and cholorides. Low cholorides and high alcohol level shows a good quality measure.
High quality wines are generally cumulated below total.sulfur.dioxide value of 150.
In multivariate analysis, quality and alcohol values are grouped and factorized to add one more dimension into visualization.
High quality wines were found to have high level of alcohol, low level of residual.sugar, low density, high citric acid, low cholorides, low sulfur dioxided and low volatile acidity.
The third dimension increased the quality of visualiation and pattern detection. Since the relationships were generally non-linear, grouping of alchohol and quality values provided further insight into the graphs and analysis. I was expecting ‘taste clusters’ in the dataset and I was able to find these clusters for high quality and low quality wines.
The medium quality wine faeture values are generally dispersed all over the data. Therefore, further analysis should be conducted to find detect detailed patterns.
Alcohol residual.sugar interaction was surprising. High level of alcohol wines included low level of residual sugar, contrary to my expectations.
Alcohol and quality has a positive relationship (after quality of 5) and the relationship is very close to linear. Although there exist fewer data points, there exist a negative relationship between quality values of 3 to 5. However, when extremes are trimmed as in the first graph, it is easier to observe the trend.
The graph shows that there exist a negative relationship betwwen volitile acidity and quality. We can also observe that high quality wines include high alcohol level. Furthermore, it can be seen that the seperation of alcohol in high volatile acidity increases.
The negative relationship between alcohol and residual sugar is deteched. Although the variance is quite high, the smoothing curve shows the average residual sugar by alcohol. It is interesting to see that residual.sugar decreased by increasing alcohol significantly.
This analysis is conducted to explore the features in white wines and their relationships and interactions among them. The main purpose was to investigate the feature values in high quality and low quality wines. This study helped me to extract the main features of high quality and low quality wines. The middle quality wines generally do not have significant extreme values. Rather, the data points for the high quality wines are dispersed all over the graph.
The multivariate analysis and grouping quality values helped me to see the data clusters easily.
Box-plots of features by quality values helped to detect small differences among groups which were impossible to detect through line and point graphs. For instance, citric acid value was significantly high for Quality-9 wines. I could not see it in line graph easily.
Althogh there exist features at the same values in high-quality and low-quality wines, alcohol level, citric acid level, cholorides residual sugar and density levels are quite different. From this study, common properties of high quality wines can be extracted easily.If a company would like to produce high quality wine , it could use the findings as a blueprint.
Some limitations made the interpretations more difficult. The 10-point scale may be an important limitation. The difference between the wine types are squeezed between quality 3 and quality 9.
A future work should be constructing a model to classify a wine as low-quality or high quality using the features.
There were no 10 point wine and not 1 or 2 point wines which made the most wines middle quality. This may be either due to rater bias and floor and ceiling effect.